Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation
نویسندگان
چکیده
Classified s-grams have been successfully used in cross-language information retrieval (CLIR) as an approximate string matching technique for translating out-of-vocabulary (OOV) words. For example, s-grams have consistently outperformed other approximate string matching techniques, like edit distance or n-grams. The Jaccard coefficient has traditionally been used as an s-gram based string proximity measure. However, other proximity measures for s-gram matching have not been tested. In the current study the performance of seven proximity measures for classified s-grams in CLIR context was evaluated using eleven language pairs. The binary proximity measures performed generally better than their non-binary counterparts, but the difference depended mainly on the padding used with s-grams. When no padding was used, the binary and non-binary proximity measures were nearly equal, though the performance at large deteriorated.
منابع مشابه
Statistical Machine Translation of Parliamentary Proceedings Using Morpho-Syntactic Knowledge
This paper presents an overview of the University of Washington statistical machine translation system developed for the 2006 TCSTAR evaluation campaign. We use a statistical phrase-based system with multiple decoding passes and a log-linear probability model. Our main focus was on exploring the possibility of using morpho-syntactic knowledge (lemmas and part-of-speech tags) for word alignment,...
متن کاملVocabulary Lists for EAP and Conversation Students
Despite the abundance of research investigating general and academic vocabularies and developing dozens of word lists, few studies have compared academic vocabulary with general service word lists such as conversation vocabulary. Many EAP researchers assume that university students need to know all the words in West’s (1953) General Service List (GSL) as a prerequisite to academic words (e.g., ...
متن کاملWord Type Effects on L2 Word Retrieval and Learning: Homonym versus Synonym Vocabulary Instruction
The purpose of this study was twofold: (a) to assess the retention of two word types (synonyms and homonyms) in the short term memory, and (b) to investigate the effect of these word types on word learning by asking learners to learn their Persian meanings. A total of 73 Iranian language learners studying English translation participated in the study. For the first purpose, 36 freshmen from an ...
متن کاملA Syllable-based Technique for Word Embeddings of Korean Words
Word embedding has become a fundamental component to many NLP tasks such as named entity recognition and machine translation. However, popular models that learn such embeddings are unaware of the morphology of words, so it is not directly applicable to highly agglutinative languages such as Korean. We propose a syllable-based learning model for Korean using a convolutional neural network, in wh...
متن کاملTamil to English Cross Lingual Information Retrieval System for Agricultural Domain Using VSM
Language processing is prompt research area across the country. In that, query translation is one of the major areas of research for the past ten decades. Tamil is morphologically rich and complex language. The suitable morphological processing is very important for Cross Lingual Information Retrieval (CLIR). The contributions towards Tamil to English query translation and transliteration are l...
متن کامل